Exploratory Data Analysis for Red Wine Quanlity by Xiangming Zeng

Introduction

In this project, we will do exploratory data analysis using the Red Wind Quality data set.

This data set contains contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).This variable dictionary explains the variables in the data set and how the data was collected.

Univariate Plots Section

First, we need to get some basic understanding about the dataset. The dimension of the dataset is:

## [1] 1599   13

The variables in this data set are:

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

The strucutre of this data set is as follow:

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

As we can see from the structure of the data set, the X variable is the index of data set. We also have different quality for the red wine as follow:

## [1] 5 6 7 4 8 3

The basic statistics of the data set is as follow:

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Generally, the fixed acidity (with median 7.90) is larger than the volatile acidity (with median 0.52). The density of red wine lies in a small range, with minimum value of 0.9901, maximum of 1.0037, and mean 0.9967. The alcohol of red wine ranges from 8.40 to 14.90, while the quality ranges from 3 to 8.

I plot the distribution of red wine quality to see how it looks like.

Most wine falls into quality 5 and 6, and the distribution looks like a normal distribution.

I also plot the distribution of each variables to see if these data also follow a normal distribution.

The distribution of fixed.acidity seems normal.

The distribution of volatile.acidity also seem normal with a little bit right tail.

citric.acid is skewed. Most locate in the left side.

residual.sugar is also highly skewed with a long right tail.

chlorides has similar distribution as citric.acid.

free.sulfur.dioxide is also skewed.

free.sulfur.dioxide is also skewed.

density is normal distribution.

pH is normal distribution.

sulphates has a long right tail.

alcohol has a right long tail.

While most variables follow normal distribution, some have long tails, such as critic.acid, total.sulfur.dioxide and residual.sugar.

For those variables with long tails, I transform them into log scale, and see what their log values look like.

log(citric.acid) doesn’t give us a normal distribution.

log(residual.sugar) still has a long right tail.

log(chlorides) seems like a normal distribution.

log(free.sulfur.dioxide) seems normal.

log(total.sulfur.dioxide) seems normal.

log(alcohol) seems no big difference.

total.sulfur.dioxide in log scale show a normal distribution. Other do not show significant changes in terms of distribution.

total.acidity seems normal.

Univariate Analysis

What is the structure of your dataset?

There are 1,599 wines in the dataset with 13 features (X, fixed.acidity, volatile.acidity, critric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, and quality). The variable X is the index of the dataset. The quality is an factor variable with levels: 3, 4, 5, 6, 7, 8.

Other observations:

  • The fixed acidity (with median 7.90) is larger than the volatile acidity (with median 0.52).
  • The density of red wine lies in a small range, with minimum value of 0.9901, maximum of 1.0037, and mean 0.9967.
  • The alcohol of red wine ranges from 8.40 to 14.90.

What is/are the main feature(s) of interest in your dataset?

The main features in the dataset are density, pH, and total.sulfur.dioxide. The quality of red wine may be predicted with the main features and some combination of the other variables.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

fixed.acidity, volatile.acidity, sulphates, alcohol, residual.sugar, chlorides and total.sulfur.dioxide may also contribute to the quality of red wine.

Did you create any new variables from existing variables in the dataset?

I created a new variable total.acidity by adding the fixed.acidity and volatile.acidity together, because I think the total.acidity may better represent the quality of red wine.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Some of the features have long tails. Therefore, I tranformed them (e.g. total.sulfur.dioxide) into log scale to see if the log scaled variables better follow a normal distribuiton. The reason is that because the quality variable generally follow a normal distribution, variables with normal distribution may be better used to predict the quality of red wine.

Bivariate Plots Section

I would like to see how different variables correlate with each other. So I plot the following correlation figure.

It look likes some variables are well correlated with each other. For further analysis, I plot some of them in a single figure, such as pH vs fixed.acidity and alchol vs quality. And some of them do show good correlations.

pH seems to be negatively correlated with fixed.acidity.

High quality red wine tends to have high alcohol.

High alcohol red wine seems to have low density.

density is highly correlated with fixed.acidity.

From the above plots, we can see the fixed.acidity and pH seem to be negatively correlated, while alcohol and density also seem negatively correlated. The density and fixed.acidity have a pretty good positive correlation.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The fixed.acidity has a negative correlation with the pH, which totally makes sense. However, the fixed.acidity also has a positive correlation with the density. And the density is negatively correlated with the pH. The quality has the highest correlation with the alcohol.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

It looks like the total.acidity has a very high correlation with the fixed.acidity. Therefore, we may not need the total.acidity variable.

The free.sulfur.dioxde and total.sulfur.dioxide seem to correlate very well with a correlation coefficient about 0.67.

The fixed.acidity and citric.acid also has a postivie correlation of 0.67.

The density is negatively correlated with the alcohol with a coefficient of -0.50.

What was the strongest relationship you found?

Except the variable I added total.acidity, the strongest relation is between fixed.acidity and pH, with a correlation coefficient of -0.68.

Multivariate Plots Section

I start to focus on variables that are correlated and related to the quanlity of red wine, such as alcohol, density, and pH.

I try to plot them together to see if there’s any interaction between them, and wheter this interaction can affect the quality of red wine.

High quality red wine typically has high alochol and low density.

No obvious relation.

High quality red wine tends to have high alochol and low pH.

No obvious relation.

Low density and pH tends to have high quality red wine.

Good correlation between pH and fixed.acidity.

High qualiy red wine tends to have high alcohol and low density.

High qualiy red wine tends to have high alcohol, low density, and low pH.

High qualiy red wine tends to low total.sulfur.dioxide, low density, and high alcohol.

I find that high quality red wines typically have low density but high alcohol, or high fixed.acidity but high alcohol, or low pH but high alcohol, or low total.sulfur.dioxide but high alcohol.

High quality red wine tends to have high alcohol/density ratio.

High quality red wine tends to have high alcohol/pH ratio.

No obvious relation.

No obvious relation.

No obvious relation.

No obvious relation.

No obvious relation.

No obvious relation.

The quality of red wine can be better distinguished by the alcohol/pH ratio as well as alcohol/density ratio showing by the distribution and mean values in the boxplots.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

High quality red wines typically have low density but high alcohol, or high fixed.acidity but high alcohol, or low pH but high alcohol, or low total.sulfur.dioxide but high alcohol.

In a word, high quality red wine generally corresponds low density, low pH, low total.sulfur.dioxide, high alcohol.

fixed.acidity and pH tend to weaken each other, which totally makes sense. While fixed.acidity and density tend to strengthen each other.

Were there any interesting or surprising interactions between features?

High quality red wine tends to have high alcohol/pH and alcohol/density ratio, which is very interesting.


Final Plots and Summary

Plot One

Description One

The distribution of red wine quality apperas to be normal with most quality lie in levels 5 and 6.

Plot Two

Description Two

The quality of red wine is related to the alcohol, density, pH and total.sulfur.dioxide. For example, the direct corrlateion between qaulity and alcohol is 0.48, while the correlation coefficients between alcohol and density is -0.50, between desnity and pH is -0.34, between alcohol and total.sulfur.dioxide is -0.21.

Generally, high quality red wine generally corresponds low density, low pH, low total.sulfur.dioxide, high alcohol.

Plot Three

Description Three

The quality of red wine is more related to the alcohol/pH ratio. From the plot, we can see the means of alcohol/pH for quality levels 6-8 (about 3.2, 3.5, and 4.2) genearlly are greater than those of levels 3-5 (about 2.9, 3.0, and 2.9). Generally, high quality red wine has high alcohol/pH ratio.


Reflection

The red wine data set contains contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

The question I’m trying to answer is what factors can affect the quality of red wine. I started by understanding the basic structure and variables of the dataset, and then I explored the distributon of different variables in the dataset. I found the quality of red wine in the dataset generally follow a normal distribution, and many other variables also have the similar distribution, such as pH. For some non-nomral distribution variables, I even transformed them to log scale to better understand their distribution. Some variables in the dataset is correlated, such fixed.acidity and pH, which is totally makes sense. After many explorations with different variables, I found the most important variables for the red wine quality are alcohol, density, and pH. Through boxplots of alcohol/pH and alcohol/density ratios for each quality level, we can easily tell the relations between these variables. That is, high quality red wine generally has high alcohol/pH and alcohol/density ratios.

For future work, one can use the factors that are important for the red wine quality to build a model and predict the quality of red wine.